Goto

Collaborating Authors

 supervised machine



Identification of Malicious Posts on the Dark Web Using Supervised Machine Learning

Filho, Sebastião Alves de Jesus, Bernardo, Gustavo Di Giovanni, Gabriel, Paulo Henrique Ribeiro, Zarpelão, Bruno Bogaz, Miani, Rodrigo Sanches

arXiv.org Artificial Intelligence

Given the constant growth and increasing sophistication of cyberattacks, cybersecurity can no longer rely solely on traditional defense techniques and tools. Proactive detection of cyber threats has become essential to help security teams identify potential risks and implement effective mitigation measures. Cyber Threat Intelligence (CTI) plays a key role by providing security analysts with evidence-based knowledge about cyber threats. CTI information can be extracted using various techniques and data sources; however, machine learning has proven promising. As for data sources, social networks and online discussion forums are commonly explored. In this study, we apply text mining techniques and machine learning to data collected from Dark Web forums in Brazilian Portuguese to identify malicious posts. Our contributions include the creation of three original datasets, a novel multi-stage labeling process combining indicators of compromise (IoCs), contextual keywords, and manual analysis, and a comprehensive evaluation of text representations and classifiers. To our knowledge, this is the first study to focus specifically on Brazilian Portuguese content in this domain. The best-performing model, using LightGBM and TF-IDF, was able to detect relevant posts with high accuracy. We also applied topic modeling to validate the model's outputs on unlabeled data, confirming its robustness in real-world scenarios.



Who Attacks, and Why? Using LLMs to Identify Negative Campaigning in 18M Tweets across 19 Countries

Hartman, Victor, Törnberg, Petter

arXiv.org Artificial Intelligence

Negative campaigning is a central feature of political competition, yet empirical research has been limited by the high cost and limited scalability of existing classification methods. This study makes two key contributions. First, it introduces zero-shot Large Language Models (LLMs) as a novel approach for cross-lingual classification of negative campaigning. Using benchmark datasets in ten languages, we demonstrate that LLMs achieve performance on par with native-speaking human coders and outperform conventional supervised machine learning approaches. Second, we leverage this novel method to conduct the largest cross-national study of negative campaigning to date, analyzing 18 million tweets posted by parliamentarians in 19 European countries between 2017 and 2022. The results reveal consistent cross-national patterns: governing parties are less likely to use negative messaging, while ideologically extreme and populist parties -- particularly those on the radical right -- engage in significantly higher levels of negativity. These findings advance our understanding of how party-level characteristics shape strategic communication in multiparty systems. More broadly, the study demonstrates the potential of LLMs to enable scalable, transparent, and replicable research in political communication across linguistic and cultural contexts.


Physics-informed features in supervised machine learning

Lampani, Margherita, Guastavino, Sabrina, Piana, Michele, Benvenuto, Federico

arXiv.org Machine Learning

The intrinsic ill-posedness of this problem can be addressed within the framework of regularization theory (Kaipio & Somersalo 2006), i.e., as the problem of minimizing a non-linear functional made of the sum of two terms: a fitting term in which the empirical risk is assessed by means of a loss function, and a penalty term that allows generalization while controlling the complexity of the solution. Finally, a real positive regularization parameter that balances the trade-off between the two terms has to be chosen by means of some regularization algorithm (Engl et al. 1996). When described in a Hilbert space setting, a representer theorem (Sch olkopf et al. 2001; De Vito et al. 2004) provides an analytical solution of the minimum problem that is given by the action of a feature-dependent kernel operator onto a vector whose components can be analytically determined by means of classical Tikhonov regularization (Tikhonov 1963). From an operational perspective, a feature-based supervised machine learning process works as follows. Given an archive of annotated descriptors of the physical phenomenon, named features, 1. A standardization procedure generates a corresponding archive of annotated standardized features that are re-scaled and made dimensionless.


Predicting Mortality and Functional Status Scores of Traumatic Brain Injury Patients using Supervised Machine Learning

Steinmetz, Lucas, Maheshwari, Shivam, Kazanjian, Garik, Loyson, Abigail, Alexander, Tyler, Margapuri, Venkat, Nataraj, C.

arXiv.org Artificial Intelligence

Traumatic brain injury (TBI) presents a significant public health challenge, often resulting in mortality or lasting disability. Predicting outcomes such as mortality and Functional Status Scale (FSS) scores can enhance treatment strategies and inform clinical decision-making. This study applies supervised machine learning (ML) methods to predict mortality and FSS scores using a real-world dataset of 300 pediatric TBI patients from the University of Colorado School of Medicine. The dataset captures clinical features, including demographics, injury mechanisms, and hospitalization outcomes. Eighteen ML models were evaluated for mortality prediction, and thirteen models were assessed for FSS score prediction. Performance was measured using accuracy, ROC AUC, F1-score, and mean squared error. Logistic regression and Extra Trees models achieved high precision in mortality prediction, while linear regression demonstrated the best FSS score prediction. Feature selection reduced 103 clinical variables to the most relevant, enhancing model efficiency and interpretability. This research highlights the role of ML models in identifying high-risk patients and supporting personalized interventions, demonstrating the potential of data-driven analytics to improve TBI care and integrate into clinical workflows.


Advancements In Heart Disease Prediction: A Machine Learning Approach For Early Detection And Risk Assessment

Ingole, Balaji Shesharao, Ramineni, Vishnu, Bangad, Nikhil, Ganeeb, Koushik Kumar, Patel, Priyankkumar

arXiv.org Artificial Intelligence

The primary aim of this paper is to comprehend, assess, and analyze the role, relevance, and efficiency of machine learning models in predicting heart disease risks using clinical data. While the importance of heart disease risk prediction cannot be overstated, the application of machine learning (ML) in identifying and evaluating the impact of various features on the classification of patients with and without heart disease, as well as in generating a reliable clinical dataset, is equally significant. This study relies primarily on cross-sectional clinical data. The ML approach is designed to enhance the consideration of various clinical features in the heart disease prognosis process. Some features emerge as strong predictors, adding significant value. The paper evaluates seven ML classifiers: Logistic Regression, Random Forest, Decision Tree, Naive Bayes, k-Nearest Neighbors, Neural Networks, and Support Vector Machine (SVM). The performance of each model is assessed based on accuracy metrics. Notably, the Support Vector Machine (SVM) demonstrates the highest accuracy at 91.51%, confirming its superiority among the evaluated models in terms of predictive capability. The overall findings of this research highlight the advantages of advanced computational methodologies in the evaluation, prediction, improvement, and management of cardiovascular risks. In other words, the strong performance of the SVM model illustrates its applicability and value in clinical settings, paving the way for further advancements in personalized medicine and healthcare.


Predictive Modeling for Breast Cancer Classification in the Context of Bangladeshi Patients: A Supervised Machine Learning Approach with Explainable AI

Islam, Taminul, Sheakh, Md. Alif, Tahosin, Mst. Sazia, Hena, Most. Hasna, Akash, Shopnil, Jardan, Yousef A. Bin, Wondmie, Gezahign Fentahun, Nafidi, Hiba-Allah, Bourhia, Mohammed

arXiv.org Artificial Intelligence

Breast cancer has rapidly increased in prevalence in recent years, making it one of the leading causes of mortality worldwide. Among all cancers, it is by far the most common. Diagnosing this illness manually requires significant time and expertise. Since detecting breast cancer is a time-consuming process, preventing its further spread can be aided by creating machine-based forecasts. Machine learning and Explainable AI are crucial in classification as they not only provide accurate predictions but also offer insights into how the model arrives at its decisions, aiding in the understanding and trustworthiness of the classification results. In this study, we evaluate and compare the classification accuracy, precision, recall, and F-1 scores of five different machine learning methods using a primary dataset (500 patients from Dhaka Medical College Hospital). Five different supervised machine learning techniques, including decision tree, random forest, logistic regression, naive bayes, and XGBoost, have been used to achieve optimal results on our dataset. Additionally, this study applied SHAP analysis to the XGBoost model to interpret the model's predictions and understand the impact of each feature on the model's output. We compared the accuracy with which several algorithms classified the data, as well as contrasted with other literature in this field. After final evaluation, this study found that XGBoost achieved the best model accuracy, which is 97%.


Supervised Machine Learning and Physics based Machine Learning approach for prediction of peak temperature distribution in Additive Friction Stir Deposition of Aluminium Alloy

Mishra, Akshansh

arXiv.org Machine Learning

Additive friction stir deposition (AFSD) is a novel solid-state additive manufacturing technique that circumvents issues of porosity, cracking, and properties anisotropy that plague traditional powder bed fusion and directed energy deposition approaches. However, correlations between process parameters, thermal profiles, and resulting microstructure in AFSD remain poorly understood. This hinders process optimization for properties. This work employs a framework combining supervised machine learning (SML) and physics-informed neural networks (PINNs) to predict peak temperature distribution in AFSD from process parameters. Eight regression algorithms were implemented for SML modeling, while four PINNs leveraged governing equations for transport, wave propagation, heat transfer, and quantum mechanics. Across multiple statistical measures, ensemble techniques like gradient boosting proved superior for SML, with lowest MSE of 165.78. The integrated ML approach was also applied to classify deposition quality from process factors, with logistic regression delivering robust accuracy. By fusing data-driven learning and fundamental physics, this dual methodology provides comprehensive insights into tailoring microstructure through thermal management in AFSD. The work demonstrates the power of bridging statistical and physics-based modeling for elucidating AM process-property relationships.


Classification of Instagram fake users using supervised machine learning algorithms

Singh, Vertika, Tolasaria, Naman, Alpeshkumar, Patel Meet, Bartwal, Shreyash

arXiv.org Artificial Intelligence

In the contemporary era, online social networks have become integral to social life, revolutionizing the way individuals manage their social connections. While enhancing accessibility and immediacy, these networks have concurrently given rise to challenges, notably the proliferation of fraudulent profiles and online impersonation. This paper proposes an application designed to detect and neutralize such dishonest entities, with a focus on safeguarding companies from potential fraud. The user-centric design of the application ensures accessibility for investigative agencies, particularly the criminal branch, facilitating navigation of complex social media landscapes and integration with existing investigative procedures